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Abstract. This paper continues our earlier investigations into the inversion 
of random functions in a general (abstract) setting. In Section|2]we investigate 
a concept of invertibility and the invertibility of the composition of random 
functions. In Section |3 we resolve some questions concerning the number of 
samples required to ensure the accuracy of parametric maximum likelihood 
estimation (MLE). A direct application to phylogeny reconstruction is given. 



This paper is a sequel of our earlier papers ^2 E] ■ We assume that the reader 
is familiar with those papers; however, we repeat the most important definitions. 

For two finite sets, A and U , let us be given a [/-valued random variable £ a for 
every a € A. We call the vector of random variables (£ a : a € A) a random function 
S : A — > U. Ordinary functions are specific instances of random functions. 

Given another random function, T, from U to V, we can speak about the com- 
position of r and S, T o 5 : A — > V, which is the vector variable ("f^ a : a £ A). In 
this paper we are concerned with inverting random functions. In other words, we 
look for random functions T : U — > A in order to obtain the best approximations 
of the identity function i : i -> A by T o H. We always assume that 3 and T 
are independent. This assumption holds for free if either S or T is a deterministic 
function. 

Consider the probability of returning a from a by the composition of two random 
functions, that is, r a — P[7^ a = a]. The assumption on the independence of 5 and 
r immediately implies 



A natural criterion is to find T for a given S in order to maximize ^2 a r a - More 
generally, we may have a weight function w : A — > M + and we may wish to maximize 
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1. Review of random functions 
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^2 a r a w(a). This can happen if we give preference to returning certain a's, or, 
if we have a prior probability distribution on A and we want to maximize the 
expected return probability for a random element of A selected according to the 
prior distribution. The following random function r* : U — > A, defined below, will 
do this job: for any fixed u G U, 

(2) 7* = a* for sure, if for all a £ A, P[£ a « = u]w(a*) > P[£ a = u]w(a). 

(In case there is more than one element a* that satisfies i|2|l. we may select uniformly 
at random from the set of such elements.) This function T* is called the maximum 
a posteriori estimator (MAP) in the literature p]. The special case when the weight 
function w is constant, is known as the maximum likelihood estimation (MLE) 

For a, 6 £ A, E : A — > U, let 



which is called the variational distance of the random variables £ a and £&. 

A given 5 : A — > U will have an \A\ x \U\ associated matrix X, such that 
x a u = P[£a = u\. Given a T : U — ► V with associated matrix G, the composition 
of r and 5, T o 5 : A — > V, will have the associated matrix XG T . 

Our motivation for the study of random functions came from phylogeny recon- 
struction [SJ|S]. Stochastic models define how biomolecular sequences are generated 
at the leaves of a binary tree. If all possible binary trees on n leaves come equipped 
with a model for generating biomolecular sequences of length k, then we have a 
random function from the set of binary trees with n leaves to the ordered n-tuples 
of biomolecular sequences of length k. Phylogeny reconstruction can be viewed as 
a random function from the set of ordered n-tuples of biomolecular sequences of 
length k to the set of binary trees with n leaves. It is a natural assumption that 
random mutations in the past are independent from any random choices in the phy- 
logeny reconstruction algorithm. Criteria for phylogeny reconstruction may differ 
according to what one wishes to optimize. However, in the practice of phylogeny 
reconstruction there are no fixed, preconceived models on the possible trees; in- 
stead, we also try to find out the model parameters. Our paper |1 1 j introduced 
a new abstract model for phylogeny reconstruction: inverting parametric random 
functions. Most of the work done on the mathematics of phylogeny reconstruction 
can be discussed in this context. This model is more structured than random func- 
tions, and hence is better suited to describe details of models of phylogeny and the 
evolution of biomolecular sequences. 

Assume that for a finite set A, for every a £ A, an (arbitrary, finite or infinite) 
set 0(a) 7^ is assigned, and moreover, 0(a) n 0(6) = for a ^ b. Set B = 
{(a, 9) : a £ A, 9 £ 0(a)} and let %% denote the natural projection from B to A. 
A parametric random function is the collection S of random variables such that 

for a £ A and 9 £ 0(a), there is a (unique) U- valued random variable C(a,e) m S. 



(3) 
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Figure 1. Inversion of non-parametric (a), and parametric (b) 
random functions 



We are interested in random functions r : U — > A independent from S so 
that 7£ (a „ best approximates 7Ti under certain criteria. Call R( a ,e) the proba- 
bility P[7f (a e) = a]. Maximum Likelihood Estimation, as it is used in situations 
where there is a discrete parameter of interest to estimate, in the presence of other 
parameters (such as phylogeny reconstruction), would take the T', for which for 
every fixed u, 7^ = a' for sure, if 

(4) V(a, 6)eB 39' e Q(a') V[^,e>) = «] > P[£(a,0) = 4 

In case there is more than one element a' that satisfies 1(2} , we may select uniformly 
at random from the set of such elements. (We avoided using the more natural 
looking quantification 39' S 0(a') V(a, 0) g £?, since P[£( a ',e') = u ] ma y n °t take a 
maximum value!) We denote by R', a g\ the probability that from the pair (a, 9) the 
Maximum Likelihood Estimation F' returns a, i.e. 

If a random function S: A — ► U (S : B — > [/) is to have A; independent 
evaluation, we denote the resulting random function by S' fc ) : A — > U k : 
B — > [7 fc ), and the random variable associated with a will be ^« . We will study 
the invertibility of S^O both in the non-parametric and the parametric setting. For 
a r : U k — > A random function, we use the notation ri fc -* = P[7^(fc) = a] in the 

non-parametric case, R^e) = ^fo^w — a ] m the parametric case, and [i? ( - fc - ) ]( Q g y 
if r' is the Maximum Likelihood Estimation. 

In Section [21 we will show that in the non-parametric setting several natural def- 
initions of invertibility of a random function are, in fact, equivalent. Furthermore, 
we determine when composition of invertible random functions is invertible. The 
main result of this Section is an explicit bound on how invertibility "improves" as 
the variational distances between elements of A have increasing separation from 
zero. 
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In Scction|31we revisit our study of the worst-case behavior of MLE in ^2]- (This 
is a very natural question in situations where a prior distribution is not given on 
A, or the inverting of the random function is to be carried out only once. Such 
a situation arises in phylogeny reconstruction, where, arguably, we do not have a 
prior distribution on alternative evolutionary scenarios, and the reconstruction is 
not going to be repeated — there is only one 'Tree of Life' that we want to know.) A 
certain amount of controversy and debate has surrounded the statistical consistency 
of MLE in phylogeny, as described in [5], pp. 270-272. Felsenstein's claim (from 
the early 1970s) of the consistency of MLE in phylogeny for simple ('identifyable') 
models is correct, but it was only formally established in 1996 by [2J- This result, like 
Wald's earlier result |14| . relies on a compactness argument, continuity, and limit 
theory, that does not give an explicit bound on k. Other proofs in the biological 
literature have generally been less rigorous and led to criticism and debate (see 
eg. H 13 [TUJ HS1 CHI) One oversight has been to treat the MLE-estimated 
continuous parameters (branch lengths) of alternative trees as fix ed rather than as 
random variables dependent on the data; such arguments are satisfying for practical 
purposes but call for more rigor. The significance of Theorem 5.1 |12j is that it gives 
the first explicit bounds for MLE, both in the phylogenetic setting and beyond. 
However, this result depended on an unnatural parameter, namely the smallest 
positive probability that an image of the object to be reconstructed can have. Here 
in Theorem 13 . 31 we get rid of this dependence, and provide a simple and immediate 
application of this new result to phylogeny reconstruction. 

We study two examples that show how subtle is MLE for inverting parametric 
random functions. The first example shows that Theorem 13.31 is "near optimal" 
in one of its parameters. The second example shows that in contrast to the non- 
parametric setting, the vanishing of variational distance does not by itself preclude 
MLE (or other) estimation for certain random functions. 

Our approach is information-theoretic, we focus on the possibility or impossibil- 
ity of inverting random functions, and not on the computational complexity issues. 
Our results can also be re-stated in the language of decision theory, by talking 
about 'loss functions' and 'risk function' associated to the decision rule. 

2. INVERTIBILITY IN THE NON-PARAMETRIC SETTING 

Let us say that a random function E : A — > U is invertible if there exists a random 
function r : U — > A such that for all a € A, P[7^ a = x] takes strict maximum when 
x = a, or equivalently, 

(6) P[7 ?a = a] ~ max{P[ 7Ca = x]} > for all a E A. 

Informally, 3 is invertible, if there is some reconstruction method that is always 
more likely to pick the generating object in A than any other element of A. 

A sufficient condition for 3 to be invertible is that there exists a F so that for 
all a E A, the following two conditions apply: 

(Ji) P[ 7 6.=a] > 4p 
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(7 2 ) ¥[ lia = b]<-fa, for all b + a. 

Note that invertibility implies (Ii), and is equivalent to it when \A\ = 2, but not 
equivalent for \A\ > 3. 

We say S separates A, if, for each distinct pair a, b £ A, the variational distance 
d(a, b) of the probability distributions of £ a and is strictly positive. 

Proposition 2.1. The following properties are equivalent for an S : A — ► [/ 

random function: 

(i) S separates A 

(ii) for all e > there is a value of k e so that for all k > k e there is a random 
function T § : U k — > A /or wAic/i P[7^ (fc) = a] > 1 — e. 

(iii) H is invertible 

(iv) for some fc > 1, is invertible. 

Proof. The equivalence between (i) and (ii) follows easily from results in our earlier 
papers ^I] an d [IS] and standard arguments. We will show that (iv) => (ii) and 
that (i) => (iii). Since (iii) =>■ (iv) is trivial this will establish the claimed four- way 
equivalence. 

Proof of (iv) => (ii) Suppose that is invertible. Select T to satisfy JHJ for 
SW. For positive integer m, generate km independent samples in U according to S. 
Define : U k — > A as follows: select the elements of A that are reconstructed most 
often according to T and choose one of them uniformly at random. By standard 
probability arguments, the probability that the correct element a will be selected 
by this process converges to 1 as m tends to infinity. 

Proof of (i) =>■ (iii) Suppose that 5 : A — > U separates A. Let X denote the 
associated matrix of 2, and let a.;, i £ A denote the rows of X. Recall that a.; gives 
the distribution of We will describe the inverse random function T : U — > A 
with its associated matrix, i.e. in the form of a \U\ x \A\ matrix G, whose rows 
represent the distribution of the element of U corresponding to the row. 

We write G — V + j^J and will give V explicitly. (If we were to take V = 0, 
then © yields uniformly = instead of the desired > 0). We denote the columns 
of V by Vj, i £ A. We define each vector Vj as follows: 



where |.| is the usual euclidean vector norm. Then it can be checked that this choice 
of V provides a solution to the following system: 

VWj ^ i &i ■ Vj - SLi ■ Vj - eij = 0; 




i=i 



0; 



VWj ^ i e. 



> 



0. 
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and these are precisely the conditions © requires for invertibility. □ 

2.1. Composition of invertible functions. A natural question is whether the 
composition of invertible functions is also invertible. The next result shows that in 
general the answer is 'no', though we can provide a precise characterization based 
on the rank of an associated matrix. 

Theorem 2.2. Let T : U — > Z be a random function matrix Y , and let Y + denote 
the extension of Y by an all-1 row. If rank{Y + ) = \U\, then for all S : A — > U 
invertible random functions, the composition T o S : A — > Z is invertible, and if 
rank is less than \U\, then there exist invertible random functions 5 : A — > U such 
that To5:j4->Z is not invertible. 



Proof. Assume first that To 5 is not invertible, i.e. there exist a^6e A, such that 
the distributions v^ a and V£ b are identical. Then we have the following homogeneous 
system of linear equations, where the coefficients are the numbers ¥[v u — z] and 
l's, and the variables are the x u 's: 

(7) P K = z ] x « = for a11 ze z - 

(8) J2x u = 0. 

The matrix Y + is the matrix of the system of homogeneous linear equations 0-0. 
Observe that x u = P[£ a = u] — P[£b = u] solves the system 0-©. If the rank of 
Y + is \U\, then it has only trivial solution, i.e. for all u € U x u = 0. This amounts 
to £ a and ^ having the same distribution, contrary to the assumption of S being 
invertible. 

Assume now that Y + has rank less than \U\. Then the system 0-© has a 
non-trivial solution x u . Set P = J2 U - x >o x u an d N = J2 U - x <o x u- Clearly 
P = -N > 0. Take A = {a,b}, ¥[£ a = u] = ^ if x u > 0, and otherwise; and 
P[£b = u] = % if x u < 0, and otherwise. It is clear that this S is invertible, as it 
separates a and b. However, according to the argument above 0, the distributions 
V£ a and V£ b are identical. □ 



2.2. Explicit bounds. From Proposition 12.11 if 5 separates A then there is a 
random function T : U — > A for which 

P[7f« = a] - p| > 0. 

We now consider putting an explicit lower bound on the right hand side of this 
inequality. That is, we show that for a specific continuous positive function h : 
K — * K (dependent only on \A\) the following holds: Suppose that d(a,b) > 6 for 
all o, 6 S A, a ^ b. Then there is a random function r : U — > A for which 

P[7« = «] - 7^7 > W) 

for all a 6 A. Note that we cannot insist the T be MLE (maximum likelihood 
estimation), even when \A\ = 2. To see this, let A = {1,2},U = {ui,u 2 } and 
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let £1 take the value u\ with probability 1, and let £2 take the values u\,U2 with 
probabilities § and h, respectively; then if T — T* is MLE, we have P[7^ 2 = 2] = |. 

Theorem 2.3. For every random function E : A — > U, with \A\ > 1, there exists a 
T : U —> A, such that 



1 1 

i mm 

- 1 1 aeA 

beA 

In particular, if for all a 7^ 6 G A, d(a, b) > S, then min a£j 4 r a > jjy + . 



Proof. Recall the characterization of the random inverse function maximizing min aey i r a 
from Theorem 5 |llj : mm ae A fa — niin M Ylueu max ae A /z(a)P[£ a = u], where fi is a 
probability distribution on A. In the rest of the proof /i refers to this minimizing 
distribution. (Note that Theorem 5 in contains an annoying typo, it shows 
maximization for /i instead of minimization). We are going to use the following 
Lemma. 

Lemma 2.4. Let us be given real numbers 61,62, ■■■,6„. Assume that 

\ii-h\>(n-l)e. 

l<i<j<n 

Then max^ - ± £" =1 b i\ > 



Proof. Without loss of generality we may assume 61 > 62 > ... > b n . The conditions 
of the Lemma can be rewritten as the conditions of the following primal linear 
program: 

62- 61 < 

63- 62 < 

K - 6„-i < 

6; — bj < — (n — l)e 



i<j 



max(- bi) - b\. 

n £ — 1 



Recall the Duality Theorem of linear programming max{c T s : Mx < 6} = 
min{y T 6 : y > 0, y t M = c}, if both optimization problems have feasible solutions. 
The dual linear program is as follows: 

{n-l)x n -xi = 

n 

Xi — + (n — 2i — l)x n = — for i = 1, 2, n — 2; 

n 

1 

x n -i + {l-n)x n = - 
n 

xi,x 2 ,...,x„ > 

min — (n — l)ex n . 



s 
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It is easy to see that the for the dual problem a feasible solution is the following 
setting: Xi = 1 — l< f~ 1 }\ for i = 1,2, ...,n — 1, and x„ = i 1 r> ; with value — — . 

to £ n(n— 1) ' ' ' ' 11 n(n— 1) ' n 

This implies that — < maxj bj — ^ J^iLi f° r an y feasible solution of the primal 
problem. □ 



We are going to apply Lemma 12.41 in the following setting. Fix an arbitrary 
u E U, and for i £ A, let bi — /x(i)P[£j = u\. The lemma yields 

(10) maxf/i(a)P[£„ = «] - p £ p(i)P[6 - u]J 

(11) > 1 4 1 C| j4| i') E |M(« = n]- M (i)Ffe=«]|. 

Observe the identity 

( 12 ) E |4tEM« = «] = ^E"« E P fe = «] = A-r 
Now identity l|12|l implies i|13|) and inequalities (|10lll|) imply inequality Q14[l: 

(13) nin r a - 777 + E j M a ) P & = u \ ~ TT\ E ^) P & = u l f 
" "' 11 ueu { 11 ieA J 

(1 4 ) > TiT+ uKui-n E E |m(0p[6 = «]-a*(3)pK# = «]|. 

11 ; ueai<i<j<|A| 

Fix an arbitrary a, b S A, and set Q = X)uGC/|^( a ) P K Q = u ] — /•'(^M^b = u ] |- Define 



U+ = 


{» 


u= = 


{" 


If- = 


{• 




•p[e 



wet/: P[£ a = u] >P[&=u]j, 
wet/: P[e o =u]=P[e 6 =u]) > 
r : P[£ a = til < P[& = u] 



B+ = Eueu+ F &> =u],B~= Y,ueu- F & = «]• Observe that 

d(a,b) = Ys^lta = u] - P[£ b = u}\= A+ - B + + B~ - A~ 



ueU 

On the other hand, 



A+ + A- = 1 - J2 = tt] = 1 - J! P &> = = S+ + B ~ 



ueU= ueU= 



From the last two equations we conclude that d(a, b) — 2(A + — B + ) = 2(B —A ). 
We finish the proof by setting a lower bound on Q with a case analysis. 



• If fj,{b) = n{a), Q = fi(a)d(a, b). 
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• If fj,(b) > /x(a), 

Q > H(a) P fe = «] - P &> = «] = ^(a)^(a, &)• 



• If //(ft) < A*(a), 



Q > //(a) ^ P[£ a = u] - ¥[tb =u} = ^{a)d{a, b). 
ue (7+ 

In all cases, we have Q > h/J,(a)d(a,b). Returning to Q14[). we find 



l<i<i<|A| u£t/ aGA be A 

and through ljT3|) . (|ul)l and (|T3j) . we have 

> ]^ + 2i^ppiy EM«)Erf(a,fc) 

- iiT + 2|^i(iVi)^^ d(a ' 6) ' 



be A 



□ 



3. The parametric setting: Maximum Likelihood Estimation (MLE) 

In this section we reconsider the question of how many i.i.d. samples are required 
in order for parametric maximum likelihood to accurately recover elements of a 
finite set. 

Assume B = {(a, 6) : a e A, 9 G 0(a)}, and S : £> — > {/ is a parametric random 
function, where A and J7 are finite sets. Define 

(16) C/+ := {u : P[e ( a, e) = u] > 0}, 

(17) a := a (a . 8) = min {P[£ (a e) = u]}, 

and assume 

(18) d:=d [afi) = inf V |P[e ( a.9) =«]~P[W) >0 - 
In our earlier work, Theorem 5 in |12j . we showed that for 

(19) fc > /(ajd )i og( ^p) ; 

k samples suffice to reconstruct a € A, from (a, 8) with probability at least 1 — e 
using MLE, more formally, for S' fc ) : £> — ► [7 fc , [i?^-*]'^ 9 ) > 1 — e. Our function / in 
(|19f) tends to infinity when either (or both) a — > or d — > 0. This dependence on d is 
reasonable (though not always necessary, see Section Yd. 2} . however the dependence 
on a is not clear, and raises two questions. 

Ql Is there an bound on k (like (|19|0 but which depends only on e and d 
and not on a? 
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Q2 Moreover, can the function / in Ijl9(l be replaced by a function of just d and 
e (and not a and U + ) so that the resulting function is still a valid bound 
for fc? 



In this section we show that the answer to the first question is 'yes' fTheorem lr>.3|) 
while the answer to the second is 'no' (Exanrple l3.1fl . 

We begin by introducing some further notation. For any two probability dis- 
tributions p,p' on a set U let d KL (p,p') = J2ueu-. Pu >o P« lo g(^) e [ >°°) u {°°} 
denote the Kullback-Leibler distance of p andp', and recall the standard inequality: 



(20) d KL (p, P ') > ^d{ Pl p')\ 

where d(p,p') denotes as usual the variational distance, J2 u eu \Pu ~ Pu\- We will 

/ \l/2 

also use d 2 (p,p') = {J2 u eu \Pu ~KI J • 

Lemma 3.1. Let X\, X%, ■ ■ ■ , Xk be a sequence of i.i.d. random variables taking 
values in a finite set U. Assume further that if Xi takes a value with probability 
zero, then it never takes this value. For each u £ U , letp u := \ Yl%=\ ^■(■^■i — u ) 
normalized multinomial counts) and let p u — P[Xi = u]. Let U + := {u : p u > 0}. 
Then, 

(i) P[d KL (p,p)>6]<^, 

(ii) V{d(p,p)>S]<lp. 



Proof. Part (i) Let A u = p u —p u - For u E U + , set Q u = iip u = 0, while if p u > 
set 

Qu ■= Pu log(— ) = (p u + A u ) log(l + — ) 

Pu Pu 

(21) < (pu + A u )-^=A u + ^. 

Pu Pu 

Recall Markov's inequality, which states that if X is non-negative random variable, 
and a > 0, then 

(22) P[X > a] < 

Note that E[(p u - p u ) 2 ] = Var[p u ] = p " (1 ~ pJ , and applying ^ to 

A 2 

x = — > 

ueu+ Pu 

and noting that E[X] = |a+ fc 1 " 1 gives P[X > <5] < By definition, d KL (p,p) = 

Y^u;p u jto Q u = Sue(7+ an< ^ tn is * s ^ ess or ec l ua l to (by CHJ, and the identity 
J2ueu+ ^ u = 0)) which leads to the required inequality. 
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Part (ii) By the Cauchy-Schwartz inequality d 2 (p,p) < d^p^p) • \U + \ and so, 

U+ 
S 2 



°[d(p, P ) > S}< P[d 2 2 (p, P ) > 5 2 /\U+\}< Bli E [d 2 2 (p,p)}. 



by Markov's inequality (1221) . Now, 

E[4(p,p)] = E[J2(Pu-Pu) 2 ] = Y Var \Pu\ = Y tM 1 ~Pu) < T- 



k 

ueu ueu ueu 



□ 



d KL (p,q) > ^d( P ,q) 2 > -(I - c) 2 d 2 a ^ 



Corollary 3.2. Under the assumptions of Lemma \3.1\ if 5 < 1, e > and k > 

2 ^ 2 I , iften with probability at least 1—e, the inequalities dKhiP^p) < £ and d(p,p) < 
6 simultaneously hold. 

Theorem 3.3. Assume B = {(a, 9) : a e A, 6 e 6(a)}, and Z : B ^ U is a 
parametric random function, where A and U are finite sets. Recall definition if_?6|) 
and condition j _?#)) . Provided k > SlI^Z I tujf/j ci = ^ — , t/ie probability that 

MLE correctly returns a from is at least 1 — e, i.e. [R^ k ']'r a g\ > 1 — £■ 

Proof. Let p be the probability distribution on C/ induced by £,( a ,e), c = 2 — -\/3, -B 
be the event that d(p,p) < c ■ d^ a gy For the probability distribution q induced by 
£o,6>') where b ^ a, by the triangle inequality we have 

d(p,q) > \d(p,q) - d(p,p)\. 

Now, by assumption d(p, q) > d( a .g), and so, conditional on i?, d(p, g) > (1 — c)d( a gy 
Invoking the inequality l|20|) gives 

irffeg) 2 > \[ 

Thus, conditional on E we have: 

(23) Y,P ulosqu - P« lo SP« " 2^ ~ c ^ d t^)- 

ueu+ ueu+ 

For x e A,oj £ 6(x), consider 

(24) «) = Y P( u ) lo S F K^ = u]. 

u£U+ 

L(x,ui) is t times the natural logarithm of the probability that the observed se- 
quence of [/-elements came from (x,lo). Therefore L(x,lu) < is proportional to 
the log-likelihood of (x,ui). Now consider the log likelihood ratio 

AL := L(a, 6) - L(b, 9') = Y P« ^g( Pu /q u ). 

u£U+ 

Conditional on E we have, by 123|) . 

(25) AL>- Y Pu\og(^) + kl-c) 2 d^ e) ^l(l~c) 2 d 2 a ^-d KL (p,p). 

So if we select 5 — c ■ d 2 a g ^ in Corollary 13 . 21 we can ensure that with probability at 
least 1 — e that event E occurs and also (since \{1 — c) 2 = c) that dKL(p,p) < S = 
c ■ d 2 a m = |(1 — c ) 2 d 2 a g y and so, by (|2"5|l we have AL > 0. The value of k that 
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Corollary |^21 requires is precisely that given in the statement of this theorem. This 
completes the proof. □ 

Remarks 

• Theorem 13.31 also implies that for MLE in the non-parametric setting, the 
number k of i.i.d. samples required to reconstruct an element a € A cor- 
rectly with probability at least 1 — e is bounded above by a function that 
depends just on \U + \, e and d a '■= minf,^ a d(a, b). In 11 an upper bound on 
k was also derived, however it depended just on \A\,e and d a - Comparing 
these results suggests an interesting question: Is there an upper bound for 
k (in the non-parametric setting) which depends just on d a and e? 

• We show below that the linear dependence of k on \U + \ in Theorem 13.31 
is best possible in the sense that no sublinear dependence is possible. It 
is possible however that the exponent of 4 for d in Theorem 13.31 might be 
reduced. 

3.1. Construction to show that k must grow linearly with \U + \. We now 
show that Theorem 13.31 cannot be improved by replacing the dependence of k on 
\U + \ with a sublinear function (like the logarithmic dependence on \ U\ + in Theorem 
5.1 even when d( a gj and e are held constant. 

Let A — {a, 6}, with 0(a) — {*}, and 

n 

8(6) = {6=(\ u ...,\ n ): J2 X i = X > Vi A * ^ °>- 

i=l 

Let U = {0, 1, .. .,n}. Fix <5 > and consider the random function 5 defined as 
follows. 

P[£(a,*) = «] = < t -s 



P[£(6,(Ai,...,A„)) = U] = 



if u G {1, . . . , n}; 

'25, ifu = 0; 

X u (l-25), if ue {l,..-,n}. 



We assume that k < n, otherwise we have nothing to prove. For u = (ui, . . . , Uk) G 
t/ fc , let x(u) =\{ie{l,...,k}:ui = 0}|. We have: 

k—x(\i) 

— „1 — A x ( u ) I _ ~ " \ 

(a.8) 



eee(a) 

and 



Li := sup = u] = <P(») 



k— x(u) 



(26) L 2 := sup P[{« = u] > (2d)-W (t^t 

since we are free to select 6 € 9(6) to be the uniform distribution on {1, . . . , n} for 
those i for which Ui ^ 0. We will select 5 sufficient small that 

(27) 2(1 - 2<5) 5/2 > 1. 
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Now, suppose we generate u randomly from (a, *). Note that the value of d( a ,*) is 
at least 5, since 

d((o, *), (6, 6')) > |PK (a ,,) = 0] - P[£ (M0 = 0] 1 = 5. 

Then MLE will (incorrectly) reconstruct 6 whenever i? := L2/L1 > 1. We will 
show that this occurs with probability at least 1 — e, if fc is less than i|?7 + |, for any 
S satisfying i|27[) and any sufficiently large | U + \ . 

Note that by replacing L 2 by its lower bound (|2(j[) . we can write R > Y k where 

Tn (1-26) 
[k (l-J)(l-p) 

where p := x(u)/k. Now, if fc < in, then since ((1 — <J)(1 — p))~' 1 ~ p - ) > 1, 

F > 2(1 - 2£) 1 " p . 

Now, for <5, e fixed, there exists a value of fc, for which, with probability at least 
1 — e, we have p > ^6. Thus for this value of fc, and any n > 2fc inequality Ij27(l 
gives 

y > 2(1 - 2 < 5) <5/2 > 1, 

and so R > 1; that is MLE will make an incorrect decision. Thus, we must have 
k > hn = 7}(\U + \ — 1) in order to avoid this. 

3.2. Example to show that parametric MLE can still succeed when varia- 
tional distance vanishes on each element of A. In the non-parametric setting, 
given a random function 5 : A — * U, suppose that d(a, b) — for two elements 
a, b 6 A. Then for any random function T : U — > A it is easily shown (eg. by 
Theorem 3.1 of [Hj) that 

(28) niin{P[ 7£oi = Oi],P[76. a = aa]} < ^ 

That is, if the probability distribution induced by a\ and a 2 is the same, no method 
can recover both a\ and a 2 more accurately than by a toss of a fair coin. We can 
ask if a similar result holds for parametric MLE. That is, suppose that A = {a±, a 2 } 
and for a value 61 £ O(ai), and 9 2 £ Q(a 2 ) we have 

(29) diatfit) = d(a 2 ,8 2 ) = 0, 

where d( a ,0) is defined as in (JTSJ) . Note that Theorem 13.31 does not give a finite 
bound on fc for MLE to accurately reconstruct a\ or a 2 . However it turns out that 
for certain random functions satisfying (|29|l . if parametric MLE is used to estimate 
a\ and a 2 from fc independent trials, then for any parameter (aj, #i) chosen, and for 
even fc, the probability that the selection is correct is always strictly greater than 
moreover in all but one choice of the parameter settings (for a\) the probability 
the selection is correct tends to 1 as fc — > 00 (in the other setting it tends to | 
from above) . For this example th ere is a more pedestrian approach for estimating 
a\ or a 2 from the fc independent trials, for which the probability of making the 
correct reconstruction tends to 1 as fc tends to infinity, for all parameter settings 
(in contrast to MLE which has problems at one particular parameter settings - this 
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illustrates again the care required in consistency arguments for MLE). Note also 
that in this example, with any parameters 61,82, dl (01, 6\), (02, #2) ) > holds. 



Let A = { ai ,a 2 }, U = {(1, 0), (1, 1), (2, 0), (2, 1)}, 9(ai) = [tt/4, 37t/4), and 
6(a 2 ) = (7r/4,37r/4]. For * e 0( ai ), let P[£ (ai)t) = (1, |_2t/7rJ)] = sin 2 i, P[£ (oi|t) = 
(2, [2t/7rJ )] = cos 2 i; and for f € 9(a 2 ), let P[C( 0a ,t) = (1> L 2 */ 71 "])] = cos2 *> 
PK (Q2 , t ) = (2,L2tAJ)]-sin 2 i. 

The key observation for the argument that follows is that sin 2 1 > cos 2 1 in 
(tt/4, 37r/4), while in the endpoints sin 2 i = 1/2 = cos 2 i. It is easy to see that 



lim t _ f ^+ dl (01, 7r/4), (02, i)J = 0, and hence d( 0ll7r /4) = 0. A similar argument 

shows that d(a 2 ,37r/4) = 0. It is also easy to see that the distributions of all £( 0i ,t) 
random variables are different. The only possible problem would be the distribu- 
tions of £(ai,7r/4) and £(a 2 ,37r/4) - however in this case we have the second coordinates 
in the elements of U to separate these distributions. There is a pedestrian way to 
guess where an element of U came from. Count the ones and twos in the first 
coordinates after k independent trials. If there are more ones, then select ai, if 
there are more 2's then select a 2 , while in the case of a tie, if [2t/ 7rJ = 0, then 
select 01, otherwise select 02- (note that \2t/Tr\ = is constant over the trials). 
MLE pretty much does the same, the only thing that requires more careful analy- 
sis is whether MLE correctly returns (ai,7r/4) and (a2,3vr/4). Focus on (a%, tt/4), 
as the other problem is analogous. Let # 1 and # 2 denote the number of ones 
and twos in the first coordinates in ff . Let p be the probability of the event 
Xi = 1 > # 2" ; by symmetry it is also the probability of the event X 2 = "# 
1 < # 2", and let q be the probability of the event X 3 = "# 1 = # 2". Note 
that MLE correctly returns ai for events X\ and X 3 (but not for X 2 ), and hence 
[■^ ](oi 7r/4) — P + ~ 2^ > \- ^ e claim holds for X 3 for the following reason. 
The probability that ff , 4 % yields the particular observed fc-sequence conditional 
on X3 is 2~ fc , while the probability that (02, #2) generated the particular observed 
fc-sequencc conditional on event X3 is p ' 2 (1 -p) fe / 2 for some p 7^ 1/2, and this 
second probability is strictly smaller than 2~ k . 

Informally, the reason for this phenomena is that the parameter space associated 
to ai is tuned for 'fitting' data that is produced by the pair (a*, Qi). 

Despite this somewhat surprising result, one can easily derive a parametric ana- 
logue of 128|) for any random function 5 : B — > U (where B — {(a, 6) : 8 6 
6(a)} as usual) under the stronger condition that d((a\, 8\), (02, 2 )) = where 
d((ai,#i), (02,^2)) is the variational distance between the distributions of the Un- 
valued random variables £(ai,0i) and £(02,02)' In this case, for any random function 
(not just parametric MLE) F — * U that is independent of S it is easily shown that 

min { p bw ei) = aiL^bWos) = ° 2 ]} ^ 2' 



Of course this bound applies also for k i.i.d. trial experiments. 
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3.3. Application of Theorem 13.31 As a simple illustration of the use of Thco- 
rcm l3.3l we describe an application to the reconstruction of phylogenetic trees from 
binary sequences according to a simple Markov process (the CFN model). Such 
processes are central to much of molecular biology (see eg. [5])- Let A denote 
the three binary phylogenetic trees that have leaf set X = {1,2,3,4}. For a tree 
T = (Vt,Et) G A, 8(a) is the set of functions p : Et — > [0,0.5] which assign to 
each edge e of T an associated substitution probability. Under the CFN model a 
state is assigned uniformly at random to a leaf (eg. leaf 1) and states are assigned 
recursively to the remaining vertices of the tree by (independently) changing the 
state (0 to 1 or 1 to 0) across each edge e of T with probability p(e) . This gives a 
(marginal) probability distribution on each of the 16 site patterns c : X — > {0, 1} 
(further details concerning this model can be found in ^2] or [H]). Thus if we gen- 
erate k site patterns i.i.d. from the pair (T,p) we can ask how large k must be in 
order for MLE to accurately reconstruct T. To ensure that d(r, v ) > one must 
impose the following condition on p. 

(P) For each of the four edges e of T incident with a leaf we have pie) < g < -|; 
and for the central edge e of T, p(e) > f > 0. 

From |13| (Lemma 6.3) we have d(T,p) > H(f,g) > for a continuous func- 
tion H. Note that condition (P) can allow arbitrarily small values for p ^ : (— 
mm ug[/+{F > [C(T,p) = u]} even when / and g take fixed values (since condition (P) 
allows two adjacent edges incident with leaves of T to both have arbitrarily small 
p(e) values, and the probability of any site pattern that assigns these two leaves 
different states can therefore be made as close to zero as we wish). Consequently, 
the main result from |12| does not provide any (finite) estimate for the site patterns 
required for MLE to correctly reconstruct a tree. However we may applying Thco- 
rem l3.3l in this setting, and since \U + \ < 16, we obtain an explicit upper bound on 
the number of site patterns required to reconstruct each phylogenetic tree on four 
leaves correctly with probability at least 1 — e. 
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